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Abstract 

Producing gene fusions through genomic structural rearrangements is a major mechanism for tumor evolution. 
Therefore, accurately detecting gene fusions and the originating rearrangements is of great importance for 
personalized cancer diagnosis and targeted therapy. We present a tool, BreakTrans, that systematically maps 
predicted gene fusions to structural rearrangements. Thus, BreakTrans not only validates both types of predictions, 
but also provides mechanistic interpretations. BreakTrans effectively validates known fusions and discovers novel 
events in a breast cancer cell line. Applying BreakTrans to 43 breast cancer samples in The Cancer Genome Atlas 
identifies 90 genomically validated gene fusions. BreakTrans is available at http://bioinformatics.mdanderson.org/ 
main/BreakTrans 



Rationale 

Many cancers are driven by pathogenic expression of 
mRNA fusion transcripts produced by genomic structural 
rearrangements (GSRs) in tumor cells. Classic examples 
include BCR-ABLl in chronic myelogenous leukaemia 
[1], PML-RARa in acute promyelocytic leukemia [2], and 
TMPRSS2-ERG in prostate cancer [3]. These fusions can 
arise from not only simple translocations of two distal 
genomic loci [4] but also complex GSRs that involve 
multiple distal loci [5-8]. Accurately identifying these 
pathogenic transcripts and the originating GSRs will have 
a major impact in personalized cancer diagnosis and 
targeted therapy [4,9]. 

Since 2008, next generation sequencing (NGS) tech- 
nologies have been applied to identify GSR breakpoints 
and gene fusions. Many bioinformatics tools such as 
BreakDancer [10], VariationHunter [11], and CREST 
[12] have been developed to detect GSRs from whole 
genome sequencing (WGS) data. These tools predict 
individual genomic breakpoints by searching for clusters 
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of abnormally mapped reads. Although generally useful, 
they often produce an appreciable number of false posi- 
tives and false negatives introduced by insufficient cov- 
erage, short insert size, misaligned reads, GC content 
bias, base caUing errors, and repeats [13]. Limitation in 
data quality and the complexity of rearrangements make 
it a challenging task to infer the structure of complex 
GSRs (or so-called genome architecture) from predicted 
individual breakpoints [14,15]. Meanwhile, many tools 
such as Tophat-fusion [16], deFuse [17], MapSplice [18], 
and BreakFusion [19] have been developed to detect 
gene fusions from whole transcriptome sequencing 
(WTS) data. These tools are algorithmically similar to 
their genomic counterparts, although they have more 
emphasis on mapping and ascertaining novel sequence 
junctions produced by mRNA-splicing and are more 
robust in modeling the coverage (expression). Again, these 
tools are associated with various types of false positives 
and false negatives [20] and often do not have good 
concordance. 

When both WGS and WTS data are available, we can 
compare them to identify GSRs that lead to gene fusions. 
Because of the technical independency of these two data 
sources, their comparison can serve as a form of valida- 
tion. In addition to improving results, mapping fusions to 
GSRs also elucidates the mechanistic origins of these 
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fusions and their potential clinical values. However, such 
analysis is complicated by several factors. First, because of 
mRNA splicing, the genomic breakpoints responsible for a 
fusion may not be located near the fusion boundaries. 
Second, a fusion may be produced via multiple genomic 
breakpoints that join segments from distal regions of the 
genome. Several types of such complex GSRs have been 
recently revealed by WGS in various cancer types 
[5-7,21,22]. Third, not all GSRs produce new genes that 
can be transcribed. The properties (for example, location, 
type, and strand) of individual GSR breakpoints and the 
potential of producing valid open reading frames from 
existing genes need to be accounted for so as to produce 
biologically meaningful results. Fourth, current NGS data 
have limited power to accurately determine the genomic 
architectures of underlying alleles [23]. The technological 
limitations in resolving repeats and phase and the lack of 
physical coverage make it difficult to derive correct results. 

To sufficiently address these challenges, systematic 
approaches are in demand. Recently, two bioinformatics 
tools, Comrad [24] and nFuse [25], were developed to 
address this challenge. Both tools align raw WGS and 
WTS reads while simultaneously corroborating fusions 
and GSRs. As an early effort, Comrad only maps a sin- 
gle fusion breakpoint to a single genomic breakpoint 
through the application of a set of ad hoc rules. As an 
update, nFuse maps fusion breakpoints to complex 
GSRs using a graph-theoretic approach. A design advan- 
tage of these tools is that they can account for ambigu- 
ous read alignment and therefore potentially minimize 
errors caused by misalignments. However, Comrad was 
only able to analyze low-path WGS data that have lim- 
ited power in discovering GSR. Moreover, the self-con- 
tained design restricts them from examining hypotheses 



produced by other well-attested algorithms such as 
Tophat-fusion, MapSplice, BreakDancer and CREST. 

To overcome these limitations, a modularly designed 
tool that focuses on mapping fusions to GSRs without 
re-performing breakpoint discovery may better serve the 
analytical demand and utilize existing resources. In this 
paper, we present such a bioinformatics tool, Break- 
Trans, that integrates the results of various fusion and 
GSR prediction algorithms and returns a set of genomi- 
cally validated fusions with their originating alleles. 

Results 

Overview of BreakTrans 

BreakTrans is designed to map gene fusions predicted 
by a set of fusion prediction programs, such as deFuse, 
MapSplice, Tophat-fusion and BreakFusion, to GSR 
breakpoints predicted by a set of GSR prediction algo- 
rithms, such as BreakDancer, CREST, and Variation- 
Hunter (Figure 1; Materials and methods). BreakTrans 
includes four major steps: 1) parse and read in GSR and 
fusion breakpoints produced by front-end tools; 2) con- 
struct a genomic breakpoint graph from GSR break- 
points; 3) search for genomic alleles (paths in the 
breakpoint graph) that support fusion hypotheses; and 
4) output validated fusions and associated genomic 
alleles. 

Cell-line SK-BR-3 

We applied BreakTrans to study the genome and tran- 
scriptome of SK-BR-3, a breast cancer cell line. We 
downloaded WTS data from the NCBI Sequence Read 
Archive [SRA:SRP003186]. We collected fusion break- 
points from three different sources. First, we analyzed 
the WTS data using Tophat-fusion-0.1.0 (beta) and 




Figure 1 Schematic overview of BreakTrans. Plotted as an example are three genes, A, B and C, that range from genomic positions (black 
nodes) a to c, d to g, and h to j, respectively. Each gene contains two exons (arrow boxes) that can be transcribed from 5' to 3'. Gene A is on 
the positive (+) strand, while genes B and C are on the negative (-) strand. Two sets of putative novel genomic breakpoints are identified from 
alignments: b+ 
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obtained 27 fusion breakpoints. Second, we analyzed the 
data using BWA [26] and BreakDancer [10] with NCBI 
human assembly build 36 as the reference. From this 
analysis, we obtained 2,065 putative fusion breakpoints 
that contained 6 of the 10 known fusion genes in SK- 
BR-3 (Table 1) [27]. To further increase sensitivity, we 
included 28 Tophat-fusion breakpoints and 1,395 deFuse 
breakpoints that were previously published using the 
same set of WTS data [28]. This set included seven 
known fusion genes. Altogether, 3,498 unique fusion 
breakpoints were obtained (Additional file 1) that 
included 7 of 10 known fusion genes. 

To obtain a set of genomic breakpoints, we generated 
80-fold 100 bp paired-end WGS reads (Illumina) from 
genomic DNA [SRA:SRP028176]. We mapped the WGS 
reads against build 36 reference using BWA and per- 
formed BreakDancer and CREST [29] analysis. Break- 
Dancer and CREST predicted 23,567 (>1,000 bp) and 
18,048 genomic breakpoints, respectively. Altogether, 
41,615 unique genomic breakpoints were obtained 
(Additional file 2). 

We ran BreakTrans-0.0.6 on these two sets of fusion 
and genomic breakpoints and obtained a set of 40 
redundant fusion breakpoints that are supported by 
genomic alleles (Additional file 3). These fusion break- 
points are redundant (in location) due to our inclusion 
of multiple sources at variable nucleotide resolutions. 
Altogether, these 40 breakpoints nominated 8 unique 
fusion genes (Table 1), including 6 of the 10 known 
fusion genes and 2 novel ones. 

Of the four known fusion genes that we missed, DHX35- 
ITCH and NFSl-PREXl were likely due to insufficient 



coverage of the transcriptome, as indicated by a previous 
study [28]. CYTH1-EIF3H w2iS due to insufficient coverage 
of the genome: neither BreakDancer nor CREST detected 
any genomic rearrangements that can be associated with 
this fusion. Although the WGS data we used have great 
sequence coverage (80-fold), their physical coverage is 
quite limited: the average insert size is only 211 bp with 
a read length of 100 bp. CSE1L-ENSG00000236127 
has become obsolete because of the exclusion of 
ENSG00000236127 from the Ensembl database, as pre- 
viously explained [28]. BreakTrans was able to validate all 
known fusions with sufficient coverage from this dataset, 
indicating its high sensitivity. 

For comparison purposes, we ran nFuse-0.1.4 on the 
same WTS and WGS datasets using default parameters. 
Among the 1,994 predicted fusion breakpoints (Additional 
file 4), only 2 of the known fusion genes {ANKHDl- 
PCDHl and SUMF1-LRRFIP2) were identified. 

The two novel fusion genes PREXl-CPNEl and MTBP- 
SAMD12 detected by BreakTrans were both nominated 
by deFuse and are both likely to be valid. PREXl, CPNEl, 
and SAMD12 have, respectively, fused with other genes 
in breast cancer cell-lines: NFSl-PREXl in SK-BR-3, 
CPNE1-PI3 in BT-474 [27], and PHF20L1-SAMD12 in 
HCC1954 [25,30]. The PREXl-CPNEl fusion occurs pre- 
cisely at the known exon boundaries (Figure SI in Addi- 
tional file 5). Genomic regions containing these genes 
underwent substantial copy number alterations (CNAs) 
(Figure 2), which have been shown to co-occur with gene 
fusions [27]. Furthermore, both PREXl and MTBP have 
been previously impUcated in breast cancer progression 
[31-33]. 



Table 1 Fusion genes and breakpoint paths predicted by BreakTrans-0.0.6 


Number 


Fusion genes 


Breakpoint paths 


1 


ANKHDl- 
EIF4EBP3>PCDH1 


5:1 3980711 7+65|5:1 41 21 7466- 


2 


CCDC85C>SETD3 


14:99059254-7] 14:9896691 7- 


3 


RARA>PKIA 


1 7:3572791 7+45 1 8:7963 7984-h 


4 


SUMF1>LRRFIP2 


3:4338455-34|3:371 58400- 


5 


TATDN1>GSDMB 


8:1 2561 8280-93] 17:35321 200- 


6 


WDR67>ZNF704 


8:81 882283-4|8:81 882470+0|8:81 91 6434+1 8]8:1 241 71 1 62-0]8:1 241 58930-36]8:1 241 61 970+ 
8:124171 162+1 8|8:81 91 6434- 


7 


MTBP>SAMDT2 


8:1 21 54785 1 +33|8:1 1 9503797+0|8:1 1 9661 057+2|8:1 1 96661 67-0|8:1 1 9662603-33]8:1 1 8985543+0]8:1 1 8990300+28] 
8:1 1 8992237-0|8:1 1 8985543-33|8:1 1 9662603+0|8:1 1 96661 67+2|8:1 1 9661 057- 


8 


PREXl >CPNE1 


20:46795673-1 7]20:33925625-0]20:33923847-1 8]20:33679982- 


9 


DHX35-ITCH 




10 


NFSl-PREXl 




11 


CYTH1-EIF3H 




12 


CSEIL- 

EN5G00000236127 





Eight gene fusions were predicted by BreakTrans-0.0.6, including 6 previously known fusions (1 to 6) and 2 novel ones (7 and 8). Four previously known fusions 
(9 to 12) were not predicted due primarily to lack of coverage. For each fusion, at least one underlying genomic allele is found, represented as a breakpoint path 
that consists of a serial of breakpoint strings (Materials and methods). '>' represents the predicted (5' to 3') order of a gene fusion. 
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Figure 2 Copy number profile in the SK-BR-3 genome. Plotted are three gene fusions predicted by Breal<Trans: (a) PREX1>CPNE1 , 
(b) MTBP>5AMD12, (c) WDR67>ZNF704. The x-axis represents genomic positions and tine y-axis represents absolute copy number in non- 
overlapping 10 kb windows. The vertical red lines mark the locations of the GSR breakpoints that led to these fusions. 



These eight fusion genes were supported by nine unique 
alleles, as shown by the breakpoint paths in Table 1. Six 
of the nine alleles contain one unique genomic break- 
point, representing the simplest way of generating 
fusion. The allele that encodes PREXl -CPNEl contains 
two breakpoints, which connect DNA segments from 



three different genes on chromosome 20. Included are 
the first three exons of PREXl, an intronic segment of 
PHF20, and the last three exons of CPNEl (Figure 3a). 
These breakpoints are highly supported by WGS data: 
17 soft-clipped reads were found at the PREX1-PHF20 
breakpoint and 18 at the PHF20-CPNE1 breakpoint 
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b 

ladder SVl SV2 SV3 SV4 SV5 SV6 




Figure 3 Breakpoints of novel fusions in SK-BR-3. (a) The novel tumor allele supporting PREX1>CPNE1 consists of the first three exons of 
PREXl (green), an intronic segment of PHF20 (orange) and the last three exons of CPNEl (blue). The RNA PREX1>CPNE1 breakpoint between 
exons A and B was nominated by deFuse, while the genomic breakpoints SVl and SV2 were detected by CREST, (b) Six genomic breakpoints 
from three gene fusions were selected for PCR validation (Additional file 8). SVl and SV2 are from PREXl >CPNE1, SV3 and SV4 from MTBP- 
SAMD12, and SV5 and SV6 from WDR67-ZNF704. Clean PCR bands were observed at SVl, SV2, SV3 and SV6. 



(Figures S2 to S5 in Additional file 5). They also overlap 
precisely with CNA boundaries (Figure 2a). The MTBP- 
SAMD12 allele contains five unique breakpoints, all 
located in a single complex amplicon on chromosome 8. 
At least three breakpoints clearly associate CNA bound- 
aries (Figure 2b) with clusters of soft-clipped reads 
identified (Figures S6 to S9 in Additional file 5). The 
breakpoint path indicates an inverted duplication, a type 
of genomic rearrangement that has been commonly 
observed in breast cancer cell lines [21]. The WDR67- 
ZNF704 fusion was supported by two different alleles, 
containing three breakpoints and one breakpoint, 
respectively. These breakpoints also associate the bound- 
aries of two distal amplicons on chromosome 8 (Figure 2c) 
with soft-clipped reads identified (Figures SIO and Sll in 
Additional file 5). 

To validate these novel fusion breakpoints, we gener- 
ated two independent paired-end RNA-seq datasets 
(SKBR3-1 and SKBR3-2; 76 bp read length) using the 



SK-BR-3 lines in our lab [SRA:SRP028176]. Both 
PREXl-CPNEl and MTBP-SAMD12 were rediscovered 
at identical breakpoints using Tophat-fusion and Break- 
Fusion, together with nine of the previous known fusions 
(Additional files 6 and 7). Note that both novel fusions 
were originally nominated using publicly available RNA-seq 
data (50 bp read length) [SRA:SRP003186] by deFuse [28], 
which employs alignment and fusion-calling algorithms 
very different from either Tophat-fusion or BreakFusion. 
Such independence in the data and in the analytical 
approaches supports both novel fusions predicted by Break- 
Trans as being real biological events. Interestingly, we also 
re-identified the genomic PREXl -PHF20 breakpoint in one 
of the RNA-seq datasets (SKBR3-2) (Additional file 7), 
which validated the existence of this breakpoint in the 
pre-mRNAs. 

We further validated a set of associated genomic 
breakpoints using PCR (Additional file 8). If these geno- 
mic breakpoints were real, we should be able to observe 
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PGR bands at expected DNA amplicon sizes. Out of the 
six breakpoints that we were able to design primers for, 
four ampUcons had very clean bands (Figure 3b), which 
included both of the two breakpoints for PREXl- 
CPNEl, one for MTBP-SAMD12 and one for WDR67- 
ZNF704, Interestingly, two different PGR bands were 
observed at one of the WDR67-ZNF704 breakpoints 
(Figure 3b), consistent with our prediction that the 
WDR67-ZNF704 fusion is associated with two different 
genomic alleles. 

BreakTrans analysis of The Cancer Genome Atlas breast 
cancer WGS and WTS datasets 

We applied BreakTrans to 43 breast cancer samples with 
both WTS and WGS data from The Gancer Genome 
Atlas (TGGA; 12/30/2012) [34]. mRNA fusion break- 
points were nominated by BreakDancer by identifying 
clusters of read pairs that span different genes from the 
WTS BAM files produced by MapSplice [18]. The geno- 
mic breakpoints were detected by two programs, Break- 
Dancer and SquareDancer (K Ghen et al., unpublished), 
which examine discordant read pairs and soft-clipped 
reads, respectively. Together, we obtained a set of 
156,955 redundant mRNA fusion breakpoints (an average 
of 3,650 per sample) and another set of 305,743 genomic 
breakpoints (an average of 6,794 per sample). We applied 
BreakTrans on these sets of breakpoints in conjunction 
with gene models specified in TGGA Genome Annota- 
tion Format (GAF) version 2.1, provided by the Univer- 
sity of Galifornia Santa Gruz. 

BreakTrans identified 177 redundant fusion breakpoints 
with convincing genomic evidence, which corresponded to 
90 unique sample gene pairs (Additional file 9). 

None of the fusions was found to be recurrent with 
identical gene pairs, suggesting a high level of heterogene- 
ity in breast cancer as consistently demonstrated by pre- 
vious studies [35]. However, we found a set of genes that 
recurrently partnered with others: CBX3, ClSorf57, 
BCAS3, RARA, USPIS, PTPRN2, USP32, FBXL20, SNX27, 
WIPF2, NFl and RAD51C, Notably, the USP family mem- 
bers {USP13, USP15, and USP32) were frequently involved 
(in five fusions). Several fusions involved a kinase at the 3' 
end and are potentially viable therapeutic targets: USP 13- 
PIK3CA, GPR160-PRKCI, and FBXL20-TLK2, Among the 
43 samples, 33 were found to have more than 2 gene 
fusions, with one (A09I) containing 10 fusion genes. 
Sample A09I also demonstrated extensive genomic 
instability with many GNAs, including focal amplification 
of over 60-fold (Figure S12 in Additional file 5). 

Most of the 90 fusions (83.3%) involved one genomic 
breakpoint and 2 distal loci. The rest involved multiple 
genomic breakpoints. For example, the NFl-NLEl 
fusion in A09I involved two breakpoints and three genes 
{NFl, CAO and NLEl; Figures S13 to S18 in Additional 



file 5), the PPPIRIB-PIPOX fusion in AODl involved 
three breakpoints and three genes {PPPIRIB, N0S2 and 
PIPOX; Figures S19 to S27 in Additional file 5), and the 
PPP3R1'TTC27 fusion in AOYG involved three break- 
points and four genes {PPP3R1, USP34, LTBPl and 
TTC27) Figures S28 to S37 in Additional file 5). Both 
NFl-NLEl and PPPIRIB-PIPOX occurred on the 
chrl7q hotspot with GSR boundaries precisely overlap- 
ping GNA boundaries (Figures S12, S14, S20 and S21 in 
Additional file 5). PPP3R1'TTC27 occurred on chromo- 
some 2 and was evidently associated with chromothrip- 
sis (Figures S29 to S31 in Additional file 5) [6]. 

To prove the validity of BreakTrans predictions, we per- 
formed PGR validation on 20 genomic breakpoints (Addi- 
tional file 10), including 9 that were associated with the 
above 3 multi-breakpoint fusions and 11 that we randomly 
selected from 9 samples. Out of these 20 breakpoints, 15 
were validated as somatic, 1 as germline, and 4 as wild 
type (Additional file 11 and Figures S38 to S41 in Addi- 
tional file 5). Further capillary sequencing of the PGR pro- 
ducts confirmed the existence of one more breakpoint 
(Figure S42 in Additional file 5). Among the validated 
breakpoints were both of the two breakpoints underlying 
NFl-NLEl, all of the three breakpoints underlying 
PPP3R1-TTC27, and two of the three breakpoints underly- 
ing PPPIRIB-PIPOX. 

Discussion 

In this work, we present a novel bioinformatics approach, 
BreakTrans, that systematically maps detected gene 
fusions to novel genomic alleles produced by GSRs, 
thereby validating both sets of hypotheses and providing 
mechanistic interpretation to validated fusions. Our analy- 
sis and experimental validation indicated very high specifi- 
city of BreakTrans. The true specificity is likely higher 
than our estimation (60 to 80%), given the difficulties in 
performing PGR validation in repetitive regions. 

Our results indicated that BreakTrans could achieve 
higher sensitivity through integration of multiple predic- 
tors without demonstrably increasing false positive rate. 
This is particularly important for current practice as indi- 
vidual predictors tend to be conservatively configured to 
achieve individually low false positive rates at the cost of 
increasing false negative rates. This phenomenon is parti- 
cularly evident in our SK-BR-3 analysis, where we 
observed a large proportion of calls unique to a predictor. 
Gonventional strategies that summarize results based on 
majority rules have been shown to be helpful in reducing 
false positives [36]. However, the further loss in sensitivity 
was usually not characterized. Applying BreakTrans to 
integrate multiple call sets is clearly a different and more 
effective strategy, as it integrates additional data. Indeed, 
the two novel fusions in the SK-BR-3 set were only nomi- 
nated by deFuse and would have been eliminated if a 
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simple consensus approach were taken. Our modular 
design allowed users to utilize their favorite predictors and 
include hypotheses from any source (for example, litera- 
ture). This feature relieves users from trying to determine 
the best predictors and post-processing strategies for their 
data, a non-trivial task. 

Another contribution of our work is that we proposed a 
convention (breakpoint string) to represent individual 
breakpoints and breakpoint graphs, as well as simple or 
complex alleles that encompass one or more breakpoints. 
This allows the reporting and communicating of large 
numbers of complex hypotheses in a concise and accurate 
way, an important requirement for large-scale sequencing 
and clinical sequencing efforts [37]. It also relieves 
researchers from manually piecing together alleles from 
individual breakpoints, a complex and error-prone task. 

Our current version does not contain a scoring system 
to characterize the confidence of output fusions and 
alleles. This is mainly due to the complexity in integrat- 
ing heterogeneous predictions from different sources, 
which are associated with heterogeneous scoring sys- 
tems and precision. With this version in place, we are 
actively working on approaches to re-score breakpoints 
and alleles using a genotype-likelihood framework 
[13,36], which will be implemented in a future version 
of BreakTrans. 

Although BreakTrans can effectively eliminate false 
breakpoints by leveraging the independence of WGS and 
WTS data and the existing knowledge of the human tran- 
scriptome, the quality of the results is clearly dependent 
on the quality of the input. If a large number of false 
breakpoints were included and true breakpoints excluded, 
any approach will have difficulty deriving correct answers. 
Improving breakpoint accuracy itself is a non-trivial task 
given the complexity of the cancer genome and the limita- 
tion of NGS [13]. Therefore, it is important to apply mod- 
ular design that allows problems and efforts to be 
distributed. BreakTrans makes it possible to separate the 
problem of breakpoint integration from that of breakpoint 
identification. Further improvement in either area will 
synergistically improve the final results. 

Similar to other programs, BreakTrans requires sufficient 
coverage on both genomic and transcriptomic breakpoints 
to validate an event. Failure to validate an event does not 
necessarily negate its existence. This is a fundamental pro- 
blem in analyzing heterogeneous tumor samples that often 
contain multiple clones of tumor cells [38] - that is, subclo- 
nal breakpoints may not receive sufficient coverage from 
standard bulk sample sequencing. However, as NGS con- 
tinues evolving and its cost continues reducing, it becomes 
increasingly feasible to obtain deep coverage on both the 
genome and the transcriptome of subclonal cell popula- 
tions [38] or even single cells [39]. 



Summary 

We have developed a bioinformatics tool, BreakTrans, that 
systematically maps gene fusions to GSRs, an application 
that is important for molecular diagnosis and targeted 
therapy. Instead of re-performing breakpoint discovery, 
BreakTrans integrates breakpoint hypotheses from various 
sources using a novel breakpoint graphic approach. Our 
examination using the WGS and WTS data from breast 
cancer cell-line SK-BR-3 indicates that BreakTrans has 
achieved higher sensitivity and specificity than existing 
approaches. Applying BreakTrans to the 43 breast cancer 
samples in TCGA, we have identified a set of 'genomically 
validated' gene fusions that are promising for further fiinc- 
tional study. As sequencing coverage continues to 
increase, we anticipate wide application of BreakTrans in 
both research and clinical settings. 

Materials and methods 

Representing genomic breakpoints 

Existing GSR detection programs such as BreakDancer 
and CREST predict individual breakpoints from clusters 
of abnormally aligned paired-end reads or soft-clipped 
reads. Each breakpoint represents a joining of two non- 
adjacent DNA segments (break-ends) that are adjacent in 
the reference genome. These breakpoints can be created 
by either simple genomic rearrangements, such as dele- 
tion, insertion, and duplication, or complex genome rear- 
rangements, such as chromothripsis or close-chain 
translocation that creates multiple breakpoints [5-7,25]. 
The resulting relationship between the two break-ends in 
the subject genome is called novel adjacency, as it does 
not exist in the reference genome. Such a breakpoint can 
be represented using a graphic representation known as a 
breakpoint graph [40]. Here, we define a breakpoint 
representation in the same vein, although it is more com- 
pact to use in our context. We define a 'breakpoint string' 
to specify exactly how two DNA break-ends are joined 
together at the breakpoint (Figure 4). A breakpoint string 
consists of two break-ends: an in-end and an out-end. 
The in-end represents the end point of a DNA segment 
before entering the breakpoint. The out-end represents 
the start point of another DNA segment after exiting the 
breakpoint. The ends are directional (double stranded). 
We use to represent the positive strand and to 
represent the negative strand. Each break-end is uniquely 
specified by a reference genomic coordinate x (consisting 
of a chromosome and a position) and a direction. We use 
a score /to quantify the confidence of the existence of 
the breakpoint. Popularly used scores include the number 
of reads or read pairs spanning the breakpoint or a geno- 
type likelihood [41]. For notational convenience, we use a 
vertical bar ^ to represent the connection between an in- 
end and an out-end. 
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Figure 4 Definition of breakpoints and breakpoint strings. For intra-chromosome rearrangements, four types of breakpoints - (a) null, 
(b) jump, (c) inverse and (d) repeat between genomic positions x and y - can be created that involve DNA on either the positive (red arrow) or 
the negative (blue arrow) strands. Edges are labeled with a number (for example, 3) representing confidence scores (for example, number of 
supporting reads) for the predicted adjacency. Edges without a number or with the number '0' represent reference adjacency (null breakpoints). 
For inter-chromosomal rearrangements, four possible novel alleles - (e) A-D, (f) B-C, (g) A-B and (h) C-D - can be created by joining four 
breakends (A, B, C, D) from two wild-type alleles (A-C, B-D) through genomic breakpoints x and y. Each allele can be represented in two different 
orientations involving combinations of either the positive (red arrow) or the negative (blue arrow) strands, (i) Breakpoint strings corresponding to 
the above configurations are listed, where '+' represents the positive strand and '-' the negative strand. The syntax of breakpoint strings is further 
explained in the Materials and methods. 



The definitions above allow us to specify breakpoints 
produced by various types of genomic structural rear- 
rangements in a consistent and concise format (for 
example, x+f\y+). We further define four types of intra- 
chromosomal rearrangement breakpoints: 'null', 'jump', 
'inverse', and 'repeat' (Figure 4a-d). A 'null' breakpoint 
represents no breakpoint between x and y and the 
sequence between them is identical to the reference 
genome. We use a special score /= 0 to denote such a 
'null' breakpoint. A 'jump' breakpoint joins together two 
non-adjacent segments on the same strand and skips 
the sequence between x and y. A breakpoint resulting 
from a deletion can be represented as a 'jump'. An 
'inverse' breakpoint joins together two non-adjacent seg- 
ments in opposite strand/orientation; it can represent 
breakpoints produced by inversions or inverted duplica- 
tion. Finally, a 'repeat' breakpoint connects x back to an 
upstream position y on the same strand; it can represent 
breakpoints produced by tandem duplication. Similarly, 



we can use a breakpoint string to represent an inter- 
chromosome breakpoint, resulting from four different 
ways of joining the break-ends (Figure 4e-h). Taken 
together, breakpoint strings defined by our rules can 
encode most, if not all, rearrangement breakpoints. 
Similar to DNA, a breakpoint string can be reverse- 
complemented by swapping the positions of x and y and 
flipping the orientations - that is, x+3|y+ is identical to 
y-3|x-albeit on the opposite strand. This feature allows 
us to encode breakpoints as undirected edges while 
enabling strand-aware search. 

Constructing the breakpoint graph 

All of the existing NGS structural variant detection soft- 
ware output breakpoints individually, representing aber- 
rant adjacencies in the subject genome. We can connect 
these breakpoints together to form a breakpoint graph, in 
which a node represents a genomic position that either 
terminates or leads a break-end, and an edge represents a 
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breakpoint. The edges are undirected and are specific to 
various types of breakpoints, as specified by the break- 
point strings. In a polyploid genome, multiple alleles 
(chromosomes) are present. A node can thereby have 
multiple edges, each representing a different allele. 
Where no aberrancy is detected, the subject genome is 
assumed to have the same allele as the reference genome. 
To represent a complete genomic architecture, edges 
with null breakpoints are added to represent the refer- 
ence alleles that connect the breakpoints. Note that our 
representation is different from those used by nFuse, in 
which a node represents a break-end on a specified 
strand. In our case, a node only represents a position; 
whether it leads or terminates a break-end on a specific 
strand depends on specifications on the connecting 
edges. 

With a breakpoint graph constructed, the task of 
decoding chromosomal architecture involves identifying 
paths that start at the beginning and extend to the end 
of the chromosome. It is clearly a computationally chal- 
lenging problem to identify correct paths in a graph that 
contains lots of nodes and edges. 

Transcriptome-guided search 

To achieve accuracy and efficiency, it is desirable to 
simplify the graph. Rather than trying to decode the 
complete genome (global optimization), we can focus on 
expressed regions (local optimization). We ignore read- 
through events, which are out of our current scope, by 
disconnecting the reference allele (null edges) between 
the end nodes of neighboring genes. We can always 
restore these connections if read-though events are of 
interest. 

Similar to a genomic breakpoint, a fusion (transcrip- 
tome) breakpoint predicted from mapping RNA-seq 
reads to the reference genome can be specified by two 
genomic positions x and y that are located in two differ- 
ent genes. To determine the underlying allele from 
which the fusion is transcribed, we first identify the 
nearest genomic breakpoints {xq, yo) downstream of x 
and y in the breakpoint graph. We then start at Xq and 
perform a recursive breadth-first search: 

p{xo) = XQ+p{n{xQ)) 

where the function p{x) denotes the alleles starting at x^ 
n{x) denotes the set of nodes that connect to x and + 
represents path extension. A path terminates if it hits 
either node y^ or the end of a gene. This search algorithm 
returns all genomic alleles (or breakpoint paths) in the 
breakpoint graph that support a fusion hypothesis. 

Data accessibility 

BreakTrans code is available to download at [42]. The 
SK-BR-3 WGS and RNA-seq data are available in the 



NCBI SRA [SRA:SRP028176]. The TCGA breast cancer 
WGS and RNA-seq data can be obtained through 
dbGAP (accession number phs000178.v7.p6). 
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Additional file 1: A list of 3,498 candidate gene fusion breakpoints 
in SK-BR-3, used as input to BreakTrans. 

Additional file 2: A list of 41,615 candidate genomic breakpoints in 
SK-BR-3, used as input to BreakTrans. 

Additional file 3: A list of 40 genomically validated fusion 
breakpoints output by BreakTrans-0.0.6. 

Additional file 4: A list of 1,994 candidate gene fusions predicted 
by nFuse-0.1.4. 

Additional file 5: Figures SI to S42. Figure SI; PREXl-CPNEl fusion 
detected from SK-BR-3 WTS data. Figures S2 to S5: integrative 
genomics views (IGVs) of read alignments at the two GSR breakpoints 
(four break-ends) underlying PREXl-CPNEl. Figures S6 to S9: IGVs of 
GSR breakpoints underlying MTBP-SAMD12. Figures SIO and Sll: IGVs 
of GSR breakpoints underlying WDR67-ZNF704. Figure SI 2: whole 
genome somatic copy number alteration (log2) of A09I. Figure SI 3: the 
NFl-NLEl fusions detected in A09I. Figure S14: somatic copy number 
alteration on chromosome 17 of A09I with red vertical lines marking 
the GSR breakpoints that support the NFl-NLEl fusion. Figures SI 5 to 
SI 8: IGV of the two GSR breakpoints (four break-ends) that underlie 
NFl-NLEl. Figure SI 9; the PPPIRIB-PIPOX fusion detected in AODl. 
Figure S20: whole-genome somatic copy number alteration (log2) of 
AODl. Figure S21: somatic copy number alteration on chromosome 17 
of AODl with red vertical lines marking the GSR breakpoints that 
support the PPPl Rl B-PIPOX fusion. Figures S22 to S27: IGVs of the three 
GSR breakpoints (6 break-ends) that underlie PPPl Rl B-PIPOX. Figure 
S28: the PPP3R1-TTC27 fusion detected in AOYG. Figure S29: whole- 
genome somatic copy number alteration (log2) of AOYG. Figure S30: 
somatic copy number alteration on chromosome 2 of AOYG with red 
vertical lines marking the GSR breakpoints that support the PPP3R1- 
TTC27 fusion. Figure S31: zoomed-in view of Figure S29 at the 
chromosome 2 chromothripsis that harbors the fusion. Figures S32 to 
S37: IGVs of the three GSR breakpoints (6 break-ends) that underlie 
PPP3R1-TFC27. Figures S38 to S41: PGR validation of TCGA BRCA 
genomic breakpoints. Figure S42; capillary sequencing trace of a PGR 
product that was not visible in the gel. 

Additional file 6: A list of 124 fusion breakpoints (42 gene pairs) 
produced by Tophat-fusion from the validation RNA-seq data. 

Additional file 7: A list of 115 fusion breakpoints (62 gene pairs) 
produced by BreakFusion-1.0.1 from the validation RNA-seq data. 

Additional file 8: Breakpoints and PGR primers for SK-BR-3 
validation. 

Additional file 9: A list of 177 gene fusions and corresponding 
genomic breakpoints produced by BreakTrans from 43 TCGA breast 
cancer samples. 

Additional file 10: Experimental design for validating 20 TCGA GSR 
breakpoints. 

Additional file 11: PCR validation of 20 TCGA GSR breakpoints in 9 
TCGA samples. 
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